
Problem Statement¶
Business Context¶
There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholes in this market.
In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.
As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.
Objective¶
To explore and visualize the dataset, build a linear regression model to predict the prices of used cars, and generate a set of insights and recommendations that will help the business.
Data Description¶
The data contains the different attributes of used cars sold in different locations. The detailed data dictionary is given below.
- Brand: brand name of the car
- Model Name: model name of the car
- Location: Location in which the car is being sold or is available for purchase (cities)
- Year: Manufacturing year of the car
- Kilometers_driven: The total kilometers driven in the car by the previous owner(s) in km
- Fuel_Type: The type of fuel used by the car (Petrol, Diesel, Electric, CNG, LPG)
- Transmission: The type of transmission used by the car (Automatic/Manual)
- Owner_Type: Type of ownership
- Mileage: The standard mileage offered by the car company in kmpl or km/kg
- Engine: The displacement volume of the engine in CC
- Power: The maximum power of the engine in bhp
- Seats: The number of seats in the car
- New_Price: The price of a new car of the same model in INR Lakhs (1 Lakh = 100,000 INR)
- Price: The price of the used car in INR Lakhs
Installing and Importing necessary libraries¶
# Installing the libraries with the specified version
!pip install --no-deps tensorflow==2.18.0 scikit-learn==1.3.2 matplotlib===3.8.3 seaborn==0.13.2 numpy==1.26.4 pandas==2.2.2 -q --user --no-warn-script-location
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.0/61.0 kB 1.4 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.9/10.9 MB 19.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.6/11.6 MB 35.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 25.2 MB/s eta 0:00:00 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.
Note:
- After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
import time
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to split the data into train and test
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import tensorflow as tf #An end-to-end open source machine learning platform
from tensorflow import keras # High-level neural networks API for deep learning.
from keras import backend # Abstraction layer for neural network backend engines.
from keras.models import Sequential # Model for building NN sequentially.
from keras.layers import Dense
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
# Set the seed using keras.utils.set_random_seed. This will set:
# 1) `numpy` seed
# 2) backend random seed
# 3) `python` random seed
keras.utils.set_random_seed(812)
# If using TensorFlow, this will make GPU ops as deterministic as possible,
# but it will affect the overall performance, so be mindful of that.
tf.config.experimental.enable_op_determinism()
Loading the dataset¶
# uncomment and run the following lines in case Google Colab is being used
# from google.colab import drive
# drive.mount('/content/drive')
# loading the dataset
data = pd.read_csv("used_cars_data.csv")
Data Overview¶
Displaying the first few rows of the dataset¶
data.head()
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Seats | New_Price | Price | mileage_num | engine_num | power_num | Brand | Model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mumbai | 2010 | 72000.0 | CNG | Manual | First | 5.0 | 5.51 | 1.75 | 26.60 | 998.0 | 58.16 | maruti | wagon |
| 1 | Pune | 2015 | 41000.0 | Diesel | Manual | First | 5.0 | 16.06 | 12.50 | 19.67 | 1582.0 | 126.20 | hyundai | creta |
| 2 | Chennai | 2011 | 46000.0 | Petrol | Manual | First | 5.0 | 8.61 | 4.50 | 18.20 | 1199.0 | 88.70 | honda | jazz |
| 3 | Chennai | 2012 | 87000.0 | Diesel | Manual | First | 7.0 | 11.27 | 6.00 | 20.77 | 1248.0 | 88.76 | maruti | ertiga |
| 4 | Coimbatore | 2013 | 40670.0 | Diesel | Automatic | Second | 5.0 | 53.14 | 17.74 | 15.20 | 1968.0 | 140.80 | audi | a4 |
Checking the shape of the dataset¶
# checking shape of the data
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns.")
There are 7252 rows and 14 columns.
Checking 10 random rows of the dataset¶
# let's view a sample of the data
data.sample(n=10, random_state=1)
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Seats | New_Price | Price | mileage_num | engine_num | power_num | Brand | Model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2397 | Kolkata | 2016 | 21460.0 | Petrol | Manual | First | 5.0 | 9.470 | 6.00 | 17.00 | 1497.0 | 121.36 | ford | ecosport |
| 6218 | Kolkata | 2013 | 48000.0 | Diesel | Manual | First | 5.0 | 7.880 | NaN | 23.40 | 1248.0 | 74.00 | maruti | swift |
| 6737 | Mumbai | 2015 | 59500.0 | Petrol | Manual | First | 7.0 | 13.580 | NaN | 17.30 | 1497.0 | 117.30 | honda | mobilio |
| 3659 | Delhi | 2015 | 27000.0 | Petrol | Automatic | First | 5.0 | 9.600 | 5.95 | 19.00 | 1199.0 | 88.70 | honda | jazz |
| 4513 | Bangalore | 2015 | 19000.0 | Diesel | Automatic | Second | 5.0 | 69.675 | 38.00 | 16.36 | 2179.0 | 187.70 | jaguar | xf |
| 599 | Coimbatore | 2019 | 40674.0 | Diesel | Automatic | First | 7.0 | 28.050 | 24.82 | 11.36 | 2755.0 | 171.50 | toyota | innova |
| 186 | Bangalore | 2014 | 37382.0 | Diesel | Automatic | First | 5.0 | 86.970 | 32.00 | 13.00 | 2143.0 | 201.10 | mercedes-benz | e-class |
| 305 | Kochi | 2014 | 61726.0 | Diesel | Automatic | First | 5.0 | 67.100 | 20.77 | 17.68 | 1968.0 | 174.33 | audi | a6 |
| 4581 | Hyderabad | 2013 | 105000.0 | Diesel | Automatic | First | 5.0 | 44.800 | 19.00 | 17.32 | 1968.0 | 150.00 | audi | q3 |
| 6616 | Delhi | 2014 | 55000.0 | Diesel | Automatic | First | 5.0 | 49.490 | NaN | 11.78 | 2143.0 | 167.62 | mercedes-benz | new |
Observations
# let's create a copy of the data to avoid any changes to original data
df = data.copy()
Checking the data types of the columns for the dataset¶
# checking column datatypes and number of non-null values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7252 entries, 0 to 7251 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Location 7252 non-null object 1 Year 7252 non-null int64 2 Kilometers_Driven 7251 non-null float64 3 Fuel_Type 7252 non-null object 4 Transmission 7252 non-null object 5 Owner_Type 7252 non-null object 6 Seats 7199 non-null float64 7 New_Price 7252 non-null float64 8 Price 6019 non-null float64 9 mileage_num 7169 non-null float64 10 engine_num 7206 non-null float64 11 power_num 7077 non-null float64 12 Brand 7252 non-null object 13 Model 7252 non-null object dtypes: float64(7), int64(1), object(6) memory usage: 793.3+ KB
Observations
- 6 columns are of the object type columns and 7 columns are of numerical type columns
Checking for duplicate values¶
# checking for duplicate values
df.duplicated().sum()
2
- There are two duplicate value in the data.
- Let's take a closer look at it.
df[df.duplicated(keep=False) == True]
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Seats | New_Price | Price | mileage_num | engine_num | power_num | Brand | Model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3623 | Hyderabad | 2007 | 52195.0 | Petrol | Manual | First | 5.0 | 4.36 | 1.75 | 19.7 | 796.0 | 46.3 | maruti | alto |
| 4781 | Hyderabad | 2007 | 52195.0 | Petrol | Manual | First | 5.0 | 4.36 | 1.75 | 19.7 | 796.0 | 46.3 | maruti | alto |
| 6940 | Kolkata | 2017 | 13000.0 | Diesel | Manual | First | 5.0 | 13.58 | NaN | 26.0 | 1498.0 | 98.6 | honda | city |
| 7077 | Kolkata | 2017 | 13000.0 | Diesel | Manual | First | 5.0 | 13.58 | NaN | 26.0 | 1498.0 | 98.6 | honda | city |
Observations
- There is a good chance that two cars of the same build were sold in the same location.
- But it is highly unlikely that both of them will have the same number of kilometers driven.
- So, we will drop the row which occurs second.
df.drop(4781, inplace=True)
df.drop(6940, inplace=True)
# checking for duplicate values
df.duplicated().sum()
0
- There are no duplicate values
Checking for missing values¶
df.isnull().sum()
| 0 | |
|---|---|
| Location | 0 |
| Year | 0 |
| Kilometers_Driven | 1 |
| Fuel_Type | 0 |
| Transmission | 0 |
| Owner_Type | 0 |
| Seats | 53 |
| New_Price | 0 |
| Price | 1232 |
| mileage_num | 83 |
| engine_num | 46 |
| power_num | 175 |
| Brand | 0 |
| Model | 0 |
- There are missing values in Kilometers_Driven, Seats, Price, mileage_num, engine_num, power_num which can be treated in data pre-processing
- We will drop the rows where
Priceis missing as it is the target variable before splitting the data into train and test
Note: The EDA section has been covered in detail in the previous case studies. In this case study, we will mainly focus on the model building aspects. We will only be looking at the key observations from EDA. The detailed EDA can be found in the appendix section.
The below functions need to be defined to carry out the Exploratory Data Analysis.
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
Univariate Analysis¶
# creating a copy of the dataframe
df1 = df.copy()
Price¶
histogram_boxplot(df1, "Price", kde=True)
Observations
- This is a highly skewed distribution.
New_Price¶
histogram_boxplot(df1, "New_Price", kde=True)
Observations
- This is another highly skewed distribution.
Brand¶
labeled_barplot(df1, "Brand", perc=True, n=10)
- Most of the cars in the data belong to Maruti or Hyundai. The price of used cars is lower for budget brands like Porsche, Bentley, Lamborghini, etc. The price of used cars is higher for premium brands like Maruti, Tata, Fiat, etc.
Location¶
labeled_barplot(df1, "Location", perc=True)
- Hyderabad and Mumbai have the most demand for used cars. The price of used cars has a large IQR in Coimbatore and Bangalore.
Fuel_Type¶
labeled_barplot(df1, "Fuel_Type", perc=True)
- Around 1% of the cars in the dataset do not run on diesel or petrol.
Bivariate Analysis¶
Correlation Check¶
plt.figure(figsize=(15, 7))
sns.heatmap(
df1.corr(numeric_only = True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
Observations
PowerandEngineare important predictors of used car price, but they are also highly correlated to each other.- The price of a new car of the same model seems to be an important predictor of the used car price, which makes sense.
Price vs Location¶
plt.figure(figsize=(12, 5))
sns.boxplot(x="Location", y="Price", data=df1)
plt.show()
- The price of used cars has a large IQR in Coimbatore and Bangalore.
Price vs Brand¶
plt.figure(figsize=(18, 5))
sns.boxplot(x="Brand", y="Price", data=df)
plt.xticks(rotation=90)
plt.show()
- The price of used cars is lower for budget brands like Maruti, Tata, Fiat, etc.
- The price of used cars is higher for premium brands like Porsche, Audi, Lamborghini, etc.
Price vs Year¶
plt.figure(figsize=(18, 5))
sns.boxplot(x="Year", y="Price", data=df1)
plt.show()
- The price of used cars has increased over the years.
Data Preprocessing¶
Missing Value Treatment¶
- Let's drop the rows having NaN in the
Pricecolumn, which is our target column.
# considering only the data points where price is not missing
df = df[df["Price"].notna()].copy()
# checking for missing values
df.isnull().sum()
| 0 | |
|---|---|
| Location | 0 |
| Year | 0 |
| Kilometers_Driven | 1 |
| Fuel_Type | 0 |
| Transmission | 0 |
| Owner_Type | 0 |
| Seats | 42 |
| New_Price | 0 |
| Price | 0 |
| mileage_num | 70 |
| engine_num | 36 |
| power_num | 143 |
| Brand | 0 |
| Model | 0 |
Encoding the categorical variables¶
df.dtypes
| 0 | |
|---|---|
| Location | object |
| Year | int64 |
| Kilometers_Driven | float64 |
| Fuel_Type | object |
| Transmission | object |
| Owner_Type | object |
| Seats | float64 |
| New_Price | float64 |
| Price | float64 |
| mileage_num | float64 |
| engine_num | float64 |
| power_num | float64 |
| Brand | object |
| Model | object |
data_car = df[['Brand', 'Model']].copy()
df = pd.get_dummies(df,
columns=df.select_dtypes(include=["object","int64"]).columns.tolist(),
drop_first=True,dtype=int
)
# Adding Brand and Model which is stored in data_car variable
# These will be needed during missing value imputation
df_final = pd.concat([df,data_car], axis=1)
df_final.shape
(6018, 287)
df_final.head()
| Kilometers_Driven | Seats | New_Price | Price | mileage_num | engine_num | power_num | Location_Bangalore | Location_Chennai | Location_Coimbatore | Location_Delhi | Location_Hyderabad | Location_Jaipur | Location_Kochi | Location_Kolkata | Location_Mumbai | Location_Pune | Year_1999 | Year_2000 | Year_2001 | Year_2002 | Year_2003 | Year_2004 | Year_2005 | Year_2006 | Year_2007 | Year_2008 | Year_2009 | Year_2010 | Year_2011 | Year_2012 | Year_2013 | Year_2014 | Year_2015 | Year_2016 | Year_2017 | Year_2018 | Year_2019 | Fuel_Type_Diesel | Fuel_Type_Electric | Fuel_Type_LPG | Fuel_Type_Petrol | Transmission_Manual | Owner_Type_Fourth & Above | Owner_Type_Second | Owner_Type_Third | Brand_audi | Brand_bentley | Brand_bmw | Brand_chevrolet | Brand_datsun | Brand_fiat | Brand_force | Brand_ford | Brand_honda | Brand_hyundai | Brand_isuzu | Brand_jaguar | Brand_jeep | Brand_lamborghini | Brand_land | Brand_mahindra | Brand_maruti | Brand_mercedes-benz | Brand_mini | Brand_mitsubishi | Brand_nissan | Brand_porsche | Brand_renault | Brand_skoda | Brand_smart | Brand_tata | Brand_toyota | Brand_volkswagen | Brand_volvo | Model_1000 | Model_3 | Model_5 | Model_6 | Model_7 | Model_800 | Model_a | Model_a-star | Model_a3 | Model_a4 | Model_a6 | Model_a7 | Model_a8 | Model_accent | Model_accord | Model_alto | Model_amaze | Model_ameo | Model_aspire | Model_aveo | Model_avventura | Model_b | Model_baleno | Model_beat | Model_beetle | Model_bolero | Model_bolt | Model_boxster | Model_br-v | Model_brio | Model_brv | Model_c-class | Model_camry | Model_captiva | Model_captur | Model_cayenne | Model_cayman | Model_cedia | Model_celerio | Model_ciaz | Model_city | Model_civic | Model_cla | Model_classic | Model_cls-class | Model_clubman | Model_compass | Model_continental | Model_cooper | Model_corolla | Model_countryman | Model_cr-v | Model_creta | Model_crosspolo | Model_cruze | Model_d-max | Model_duster | Model_dzire | Model_e | Model_e-class | Model_ecosport | Model_eeco | Model_elantra | Model_elite | Model_endeavour | Model_enjoy | Model_eon | Model_ertiga | Model_esteem | Model_estilo | Model_etios | Model_evalia | Model_f | Model_fabia | Model_fiesta | Model_figo | Model_fluence | Model_fortuner | Model_fortwo | Model_freestyle | Model_fusion | Model_gallardo | Model_getz | Model_gl-class | Model_gla | Model_glc | Model_gle | Model_gls | Model_go | Model_grand | Model_grande | Model_hexa | Model_i10 | Model_i20 | Model_ignis | Model_ikon | Model_indica | Model_indigo | Model_innova | Model_jazz | Model_jeep | Model_jetta | Model_koleos | Model_kuv | Model_kwid | Model_lancer | Model_laura | Model_linea | Model_lodgy | Model_logan | Model_m-class | Model_manza | Model_micra | Model_mobilio | Model_montero | Model_mustang | Model_mux | Model_nano | Model_new | Model_nexon | Model_nuvosport | Model_octavia | Model_omni | Model_one | Model_optra | Model_outlander | Model_pajero | Model_panamera | Model_passat | Model_petra | Model_platinum | Model_polo | Model_prius | Model_pulse | Model_punto | Model_q3 | Model_q5 | Model_q7 | Model_qualis | Model_quanto | Model_r-class | Model_rapid | Model_redi | Model_redi-go | Model_renault | Model_ritz | Model_rover | Model_rs5 | Model_s | Model_s-class | Model_s-cross | Model_s60 | Model_s80 | Model_safari | Model_sail | Model_santa | Model_santro | Model_scala | Model_scorpio | Model_siena | Model_sl-class | Model_slc | Model_slk-class | Model_sonata | Model_spark | Model_ssangyong | Model_sumo | Model_sunny | Model_superb | Model_swift | Model_sx4 | Model_tavera | Model_teana | Model_terrano | Model_thar | Model_tiago | Model_tigor | Model_tiguan | Model_tt | Model_tucson | Model_tuv | Model_v40 | Model_vento | Model_venture | Model_verito | Model_verna | Model_versa | Model_vitara | Model_wagon | Model_wr-v | Model_wrv | Model_x-trail | Model_x1 | Model_x3 | Model_x5 | Model_x6 | Model_xc60 | Model_xc90 | Model_xcent | Model_xe | Model_xenon | Model_xf | Model_xj | Model_xuv300 | Model_xuv500 | Model_xylo | Model_yeti | Model_z4 | Model_zen | Model_zest | Brand | Model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 72000.0 | 5.0 | 5.51 | 1.75 | 26.60 | 998.0 | 58.16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | maruti | wagon |
| 1 | 41000.0 | 5.0 | 16.06 | 12.50 | 19.67 | 1582.0 | 126.20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | hyundai | creta |
| 2 | 46000.0 | 5.0 | 8.61 | 4.50 | 18.20 | 1199.0 | 88.70 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | honda | jazz |
| 3 | 87000.0 | 7.0 | 11.27 | 6.00 | 20.77 | 1248.0 | 88.76 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | maruti | ertiga |
| 4 | 40670.0 | 5.0 | 53.14 | 17.74 | 15.20 | 1968.0 | 140.80 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | audi | a4 |
Train Test Split¶
# defining the dependent and independent variables
X = df_final.drop(["Price"], axis=1)
y = df_final["Price"]
# splitting the data in 80:20 ratio for train and temporary data
x_train, x_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2,random_state=1)
# splitting the temporary data in 50:50 ratio for validation and test data
x_val,x_test,y_val,y_test = train_test_split(x_temp,y_temp,test_size=0.5,random_state=1)
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in validation data =", x_val.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 4814 Number of rows in validation data = 602 Number of rows in test data = 602
Missing Value Treatment¶
def print_missing_values_columns(df):
"""
Filters and prints only the columns from the DataFrame df that contain missing values.
Parameters:
- df: DataFrame
The DataFrame to check for missing values.
"""
missing_values_columns = df.columns[df.isnull().any()]
missing_values_counts = df[missing_values_columns].isnull().sum()
print(missing_values_counts)
# train data
print_missing_values_columns(x_train)
Kilometers_Driven 1 Seats 39 mileage_num 59 engine_num 34 power_num 116 dtype: int64
# validation data
print_missing_values_columns(x_val)
Seats 1 mileage_num 5 power_num 13 dtype: int64
# test data
print_missing_values_columns(x_test)
Seats 2 mileage_num 6 engine_num 2 power_num 14 dtype: int64
We'll impute these missing values one-by-one by taking the median number of seats for the particular car using the Brand and Model.
# first, we calculate the median of Seats in the train set grouped by Brand and Model and store in train_grouped_median
train_grouped_median = x_train.groupby(["Brand", "Model"])["Seats"].median()
train_grouped_median
| Seats | ||
|---|---|---|
| Brand | Model | |
| ambassador | classic | 5.0 |
| audi | a3 | 5.0 |
| a4 | 5.0 | |
| a6 | 5.0 | |
| a7 | 5.0 | |
| ... | ... | ... |
| volvo | s60 | 5.0 |
| s80 | 5.0 | |
| v40 | 5.0 | |
| xc60 | 5.0 | |
| xc90 | 7.0 |
209 rows × 1 columns
Working of the above code
- It groups the training dataset
x_trainby the columnsBrandandModel - Within each group, it selects the
Seatscolumn - Then, it calculates the median of the
Seatscolumn for each group - This step effectively creates a mapping of the median number of seats for each unique combination of
BrandandModel
# we will use the calculated median (train_grouped_median) to fill missing values in Seats for corresponding groups in the train set
x_train["Seats"] = x_train.apply(lambda row: row["Seats"] if not pd.isna(row["Seats"]) else train_grouped_median.get((row["Brand"], row["Model"]), np.nan), axis=1)
Working of the above code
For each row in the training dataset x_train:
It checks if the value in the selected row of the
Seatscolumn (row["Seats"]) is not NaN (pd.isna(row["Seats"]))If the value is not NaN (i.e., it's not missing), it keeps the original value (
row["Seats"])If the value is NaN (missing), it uses
train_grouped_median.get((row["Brand"], row["Model"]), np.nan)to fetch the median value for the correspondingBrandandModelcombination from the train_grouped_median mapping created previously- If there's no corresponding median value (i.e., the combination of
BrandandModeldoesn't exist intrain_grouped_median), it assigns NaN (np.nan).
- If there's no corresponding median value (i.e., the combination of
This step essentially fills missing values in the Seats column of the validation dataset x_train using the median values calculated from the training dataset. It ensures that the imputation is done based on the specific Brand and Model combination, preserving the relationship between these features and the Seats column.
# checking data points where Seats is still missing
x_train[x_train["Seats"].isnull()]
| Kilometers_Driven | Seats | New_Price | mileage_num | engine_num | power_num | Location_Bangalore | Location_Chennai | Location_Coimbatore | Location_Delhi | Location_Hyderabad | Location_Jaipur | Location_Kochi | Location_Kolkata | Location_Mumbai | Location_Pune | Year_1999 | Year_2000 | Year_2001 | Year_2002 | Year_2003 | Year_2004 | Year_2005 | Year_2006 | Year_2007 | Year_2008 | Year_2009 | Year_2010 | Year_2011 | Year_2012 | Year_2013 | Year_2014 | Year_2015 | Year_2016 | Year_2017 | Year_2018 | Year_2019 | Fuel_Type_Diesel | Fuel_Type_Electric | Fuel_Type_LPG | Fuel_Type_Petrol | Transmission_Manual | Owner_Type_Fourth & Above | Owner_Type_Second | Owner_Type_Third | Brand_audi | Brand_bentley | Brand_bmw | Brand_chevrolet | Brand_datsun | Brand_fiat | Brand_force | Brand_ford | Brand_honda | Brand_hyundai | Brand_isuzu | Brand_jaguar | Brand_jeep | Brand_lamborghini | Brand_land | Brand_mahindra | Brand_maruti | Brand_mercedes-benz | Brand_mini | Brand_mitsubishi | Brand_nissan | Brand_porsche | Brand_renault | Brand_skoda | Brand_smart | Brand_tata | Brand_toyota | Brand_volkswagen | Brand_volvo | Model_1000 | Model_3 | Model_5 | Model_6 | Model_7 | Model_800 | Model_a | Model_a-star | Model_a3 | Model_a4 | Model_a6 | Model_a7 | Model_a8 | Model_accent | Model_accord | Model_alto | Model_amaze | Model_ameo | Model_aspire | Model_aveo | Model_avventura | Model_b | Model_baleno | Model_beat | Model_beetle | Model_bolero | Model_bolt | Model_boxster | Model_br-v | Model_brio | Model_brv | Model_c-class | Model_camry | Model_captiva | Model_captur | Model_cayenne | Model_cayman | Model_cedia | Model_celerio | Model_ciaz | Model_city | Model_civic | Model_cla | Model_classic | Model_cls-class | Model_clubman | Model_compass | Model_continental | Model_cooper | Model_corolla | Model_countryman | Model_cr-v | Model_creta | Model_crosspolo | Model_cruze | Model_d-max | Model_duster | Model_dzire | Model_e | Model_e-class | Model_ecosport | Model_eeco | Model_elantra | Model_elite | Model_endeavour | Model_enjoy | Model_eon | Model_ertiga | Model_esteem | Model_estilo | Model_etios | Model_evalia | Model_f | Model_fabia | Model_fiesta | Model_figo | Model_fluence | Model_fortuner | Model_fortwo | Model_freestyle | Model_fusion | Model_gallardo | Model_getz | Model_gl-class | Model_gla | Model_glc | Model_gle | Model_gls | Model_go | Model_grand | Model_grande | Model_hexa | Model_i10 | Model_i20 | Model_ignis | Model_ikon | Model_indica | Model_indigo | Model_innova | Model_jazz | Model_jeep | Model_jetta | Model_koleos | Model_kuv | Model_kwid | Model_lancer | Model_laura | Model_linea | Model_lodgy | Model_logan | Model_m-class | Model_manza | Model_micra | Model_mobilio | Model_montero | Model_mustang | Model_mux | Model_nano | Model_new | Model_nexon | Model_nuvosport | Model_octavia | Model_omni | Model_one | Model_optra | Model_outlander | Model_pajero | Model_panamera | Model_passat | Model_petra | Model_platinum | Model_polo | Model_prius | Model_pulse | Model_punto | Model_q3 | Model_q5 | Model_q7 | Model_qualis | Model_quanto | Model_r-class | Model_rapid | Model_redi | Model_redi-go | Model_renault | Model_ritz | Model_rover | Model_rs5 | Model_s | Model_s-class | Model_s-cross | Model_s60 | Model_s80 | Model_safari | Model_sail | Model_santa | Model_santro | Model_scala | Model_scorpio | Model_siena | Model_sl-class | Model_slc | Model_slk-class | Model_sonata | Model_spark | Model_ssangyong | Model_sumo | Model_sunny | Model_superb | Model_swift | Model_sx4 | Model_tavera | Model_teana | Model_terrano | Model_thar | Model_tiago | Model_tigor | Model_tiguan | Model_tt | Model_tucson | Model_tuv | Model_v40 | Model_vento | Model_venture | Model_verito | Model_verna | Model_versa | Model_vitara | Model_wagon | Model_wr-v | Model_wrv | Model_x-trail | Model_x1 | Model_x3 | Model_x5 | Model_x6 | Model_xc60 | Model_xc90 | Model_xcent | Model_xe | Model_xenon | Model_xf | Model_xj | Model_xuv300 | Model_xuv500 | Model_xylo | Model_yeti | Model_z4 | Model_zen | Model_zest | Brand | Model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2369 | 56000.0 | NaN | 7.88 | 19.5 | 1061.0 | NaN | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | maruti | estilo |
| 5893 | 51000.0 | NaN | 7.88 | 19.5 | 1061.0 | NaN | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | maruti | estilo |
- Maruti Estilo can accommodate 5 people.
x_train["Seats"] = x_train["Seats"].fillna(5.0)
# we will use the calculated median (train_grouped_median) to fill missing values in Seats for corresponding groups in the validation set
x_val["Seats"] = x_val.apply(lambda row: row["Seats"] if not pd.isna(row["Seats"]) else train_grouped_median.get((row["Brand"], row["Model"]), np.nan), axis=1)
- The above code does the same operation as the one previously used for imputing missing values
- The only difference is that it operates on the validation set (
x_val) instead of the training set (x_train)
# checking the missing values in x_val
print_missing_values_columns(x_val)
Seats 1 mileage_num 5 power_num 13 dtype: int64
# checking data points where Seats is still missing
x_val[x_val["Seats"].isnull()]
| Kilometers_Driven | Seats | New_Price | mileage_num | engine_num | power_num | Location_Bangalore | Location_Chennai | Location_Coimbatore | Location_Delhi | Location_Hyderabad | Location_Jaipur | Location_Kochi | Location_Kolkata | Location_Mumbai | Location_Pune | Year_1999 | Year_2000 | Year_2001 | Year_2002 | Year_2003 | Year_2004 | Year_2005 | Year_2006 | Year_2007 | Year_2008 | Year_2009 | Year_2010 | Year_2011 | Year_2012 | Year_2013 | Year_2014 | Year_2015 | Year_2016 | Year_2017 | Year_2018 | Year_2019 | Fuel_Type_Diesel | Fuel_Type_Electric | Fuel_Type_LPG | Fuel_Type_Petrol | Transmission_Manual | Owner_Type_Fourth & Above | Owner_Type_Second | Owner_Type_Third | Brand_audi | Brand_bentley | Brand_bmw | Brand_chevrolet | Brand_datsun | Brand_fiat | Brand_force | Brand_ford | Brand_honda | Brand_hyundai | Brand_isuzu | Brand_jaguar | Brand_jeep | Brand_lamborghini | Brand_land | Brand_mahindra | Brand_maruti | Brand_mercedes-benz | Brand_mini | Brand_mitsubishi | Brand_nissan | Brand_porsche | Brand_renault | Brand_skoda | Brand_smart | Brand_tata | Brand_toyota | Brand_volkswagen | Brand_volvo | Model_1000 | Model_3 | Model_5 | Model_6 | Model_7 | Model_800 | Model_a | Model_a-star | Model_a3 | Model_a4 | Model_a6 | Model_a7 | Model_a8 | Model_accent | Model_accord | Model_alto | Model_amaze | Model_ameo | Model_aspire | Model_aveo | Model_avventura | Model_b | Model_baleno | Model_beat | Model_beetle | Model_bolero | Model_bolt | Model_boxster | Model_br-v | Model_brio | Model_brv | Model_c-class | Model_camry | Model_captiva | Model_captur | Model_cayenne | Model_cayman | Model_cedia | Model_celerio | Model_ciaz | Model_city | Model_civic | Model_cla | Model_classic | Model_cls-class | Model_clubman | Model_compass | Model_continental | Model_cooper | Model_corolla | Model_countryman | Model_cr-v | Model_creta | Model_crosspolo | Model_cruze | Model_d-max | Model_duster | Model_dzire | Model_e | Model_e-class | Model_ecosport | Model_eeco | Model_elantra | Model_elite | Model_endeavour | Model_enjoy | Model_eon | Model_ertiga | Model_esteem | Model_estilo | Model_etios | Model_evalia | Model_f | Model_fabia | Model_fiesta | Model_figo | Model_fluence | Model_fortuner | Model_fortwo | Model_freestyle | Model_fusion | Model_gallardo | Model_getz | Model_gl-class | Model_gla | Model_glc | Model_gle | Model_gls | Model_go | Model_grand | Model_grande | Model_hexa | Model_i10 | Model_i20 | Model_ignis | Model_ikon | Model_indica | Model_indigo | Model_innova | Model_jazz | Model_jeep | Model_jetta | Model_koleos | Model_kuv | Model_kwid | Model_lancer | Model_laura | Model_linea | Model_lodgy | Model_logan | Model_m-class | Model_manza | Model_micra | Model_mobilio | Model_montero | Model_mustang | Model_mux | Model_nano | Model_new | Model_nexon | Model_nuvosport | Model_octavia | Model_omni | Model_one | Model_optra | Model_outlander | Model_pajero | Model_panamera | Model_passat | Model_petra | Model_platinum | Model_polo | Model_prius | Model_pulse | Model_punto | Model_q3 | Model_q5 | Model_q7 | Model_qualis | Model_quanto | Model_r-class | Model_rapid | Model_redi | Model_redi-go | Model_renault | Model_ritz | Model_rover | Model_rs5 | Model_s | Model_s-class | Model_s-cross | Model_s60 | Model_s80 | Model_safari | Model_sail | Model_santa | Model_santro | Model_scala | Model_scorpio | Model_siena | Model_sl-class | Model_slc | Model_slk-class | Model_sonata | Model_spark | Model_ssangyong | Model_sumo | Model_sunny | Model_superb | Model_swift | Model_sx4 | Model_tavera | Model_teana | Model_terrano | Model_thar | Model_tiago | Model_tigor | Model_tiguan | Model_tt | Model_tucson | Model_tuv | Model_v40 | Model_vento | Model_venture | Model_verito | Model_verna | Model_versa | Model_vitara | Model_wagon | Model_wr-v | Model_wrv | Model_x-trail | Model_x1 | Model_x3 | Model_x5 | Model_x6 | Model_xc60 | Model_xc90 | Model_xcent | Model_xe | Model_xenon | Model_xf | Model_xj | Model_xuv300 | Model_xuv500 | Model_xylo | Model_yeti | Model_z4 | Model_zen | Model_zest | Brand | Model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3882 | 40000.0 | NaN | 7.88 | 19.5 | 1061.0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | maruti | estilo |
- Maruti Estilo can accommodate 5 people.
x_val["Seats"] = x_val["Seats"].fillna(5.0)
# checking the missing values in x_val
print_missing_values_columns(x_val)
mileage_num 5 power_num 13 dtype: int64
# Same method is applied on test data
x_test["Seats"] = x_test.apply(lambda row: row["Seats"] if not pd.isna(row["Seats"]) else train_grouped_median.get((row["Brand"], row["Model"]), np.nan), axis=1)
# checking the missing values in x_test
print_missing_values_columns(x_test)
mileage_num 6 engine_num 2 power_num 14 dtype: int64
We will use a similar method to fill missing values for the Kilometers_Driven, mileage_num, engine_num, and power_num columns.
cols_list = ["Kilometers_Driven","mileage_num", "engine_num", "power_num"]
# Step 1: Calculate the median of specified columns in x_train grouped by Brand and Model
train_grouped_median = x_train.groupby(["Brand", "Model"])[cols_list].median()
# Step 2: Use the calculated median to fill missing values in specified columns for corresponding groups in train, validation and test data
for col in cols_list:
x_train[col] = x_train.apply(lambda row: row[col] if not pd.isna(row[col]) else train_grouped_median[col].get((row["Brand"], row["Model"]), np.nan), axis=1)
x_val[col] = x_val.apply(lambda row: row[col] if not pd.isna(row[col]) else train_grouped_median[col].get((row["Brand"], row["Model"]), np.nan), axis=1)
x_test[col] = x_test.apply(lambda row: row[col] if not pd.isna(row[col]) else train_grouped_median[col].get((row["Brand"], row["Model"]), np.nan), axis=1)
# checking the missing values in x_train
print_missing_values_columns(x_train)
mileage_num 7 power_num 9 dtype: int64
# checking the missing values in x_val
print_missing_values_columns(x_val)
mileage_num 1 power_num 1 dtype: int64
# checking the missing values in x_test
print_missing_values_columns(x_test)
mileage_num 1 power_num 1 dtype: int64
- There are still some missing values in
mileage_numandpower_num. - We'll impute these missing values by taking the median grouped by the
Brand.
cols_list = ["mileage_num", "power_num"]
# Step 1: Calculate the median of specified columns in x_train grouped by Brand and Model
train_grouped_median = x_train.groupby(["Brand"])[cols_list].median()
# Step 2: Use the calculated median to fill missing values in specified columns for corresponding groups in train, validation and test data
for col in cols_list:
x_train[col] = x_train.apply(lambda row: row[col] if not pd.isna(row[col]) else train_grouped_median[col].get((row["Brand"]), np.nan), axis=1)
x_val[col] = x_val.apply(lambda row: row[col] if not pd.isna(row[col]) else train_grouped_median[col].get((row["Brand"]), np.nan), axis=1)
x_test[col] = x_test.apply(lambda row: row[col] if not pd.isna(row[col]) else train_grouped_median[col].get((row["Brand"]), np.nan), axis=1)
print_missing_values_columns(x_train)
mileage_num 1 power_num 1 dtype: int64
print_missing_values_columns(x_val)
Series([], dtype: float64)
print_missing_values_columns(x_test)
Series([], dtype: float64)
- There are still some missing values in train data (
mileage_numandpower_num) and all missing values in val and test data are imputed. - We'll impute train missing values using the column median across the entire data.
cols_list = ["mileage_num", "power_num"]
for col in cols_list:
x_train[col] = x_train[col].fillna(df[col].median())
print_missing_values_columns(x_train)
Series([], dtype: float64)
- Missing values in all columns of x_train are imputed.
# Dropping Brand and Model from train, validation, and test data as we already have dummy variables for them
x_train = x_train.drop(['Brand','Model'],axis=1)
x_val = x_val.drop(['Brand','Model'],axis=1)
x_test = x_test.drop(['Brand','Model'],axis=1)
Normalizing the numerical variables¶
# Define the columns to scale
num_columns = ["Kilometers_Driven", "Seats", "New_Price", "mileage_num", "engine_num", "power_num"]
# Initialize the StandardScaler
scaler = StandardScaler()
# Fit the scaler to the selected columns in the x_train data
scaler.fit(x_train[num_columns])
StandardScaler()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StandardScaler()
Once the scaler object fits on the data using the
fit()method, it stores the parameters (mean and standard deviation) for normalization based on the training dataWe then use these parameters to normalize the validation and test data
This is similar to what we did in the Missing Value Treatment section.
- The only difference is that there we had to explicitly store the parameters (median values), while here it is done implicitly by sklearn in this case
# Transform selected columns in x_train, x_val, and x_test using the fitted scaler
x_train[num_columns] = scaler.transform(x_train[num_columns])
x_val[num_columns] = scaler.transform(x_val[num_columns])
x_test[num_columns] = scaler.transform(x_test[num_columns])
x_train.head()
| Kilometers_Driven | Seats | New_Price | mileage_num | engine_num | power_num | Location_Bangalore | Location_Chennai | Location_Coimbatore | Location_Delhi | Location_Hyderabad | Location_Jaipur | Location_Kochi | Location_Kolkata | Location_Mumbai | Location_Pune | Year_1999 | Year_2000 | Year_2001 | Year_2002 | Year_2003 | Year_2004 | Year_2005 | Year_2006 | Year_2007 | Year_2008 | Year_2009 | Year_2010 | Year_2011 | Year_2012 | Year_2013 | Year_2014 | Year_2015 | Year_2016 | Year_2017 | Year_2018 | Year_2019 | Fuel_Type_Diesel | Fuel_Type_Electric | Fuel_Type_LPG | Fuel_Type_Petrol | Transmission_Manual | Owner_Type_Fourth & Above | Owner_Type_Second | Owner_Type_Third | Brand_audi | Brand_bentley | Brand_bmw | Brand_chevrolet | Brand_datsun | Brand_fiat | Brand_force | Brand_ford | Brand_honda | Brand_hyundai | Brand_isuzu | Brand_jaguar | Brand_jeep | Brand_lamborghini | Brand_land | Brand_mahindra | Brand_maruti | Brand_mercedes-benz | Brand_mini | Brand_mitsubishi | Brand_nissan | Brand_porsche | Brand_renault | Brand_skoda | Brand_smart | Brand_tata | Brand_toyota | Brand_volkswagen | Brand_volvo | Model_1000 | Model_3 | Model_5 | Model_6 | Model_7 | Model_800 | Model_a | Model_a-star | Model_a3 | Model_a4 | Model_a6 | Model_a7 | Model_a8 | Model_accent | Model_accord | Model_alto | Model_amaze | Model_ameo | Model_aspire | Model_aveo | Model_avventura | Model_b | Model_baleno | Model_beat | Model_beetle | Model_bolero | Model_bolt | Model_boxster | Model_br-v | Model_brio | Model_brv | Model_c-class | Model_camry | Model_captiva | Model_captur | Model_cayenne | Model_cayman | Model_cedia | Model_celerio | Model_ciaz | Model_city | Model_civic | Model_cla | Model_classic | Model_cls-class | Model_clubman | Model_compass | Model_continental | Model_cooper | Model_corolla | Model_countryman | Model_cr-v | Model_creta | Model_crosspolo | Model_cruze | Model_d-max | Model_duster | Model_dzire | Model_e | Model_e-class | Model_ecosport | Model_eeco | Model_elantra | Model_elite | Model_endeavour | Model_enjoy | Model_eon | Model_ertiga | Model_esteem | Model_estilo | Model_etios | Model_evalia | Model_f | Model_fabia | Model_fiesta | Model_figo | Model_fluence | Model_fortuner | Model_fortwo | Model_freestyle | Model_fusion | Model_gallardo | Model_getz | Model_gl-class | Model_gla | Model_glc | Model_gle | Model_gls | Model_go | Model_grand | Model_grande | Model_hexa | Model_i10 | Model_i20 | Model_ignis | Model_ikon | Model_indica | Model_indigo | Model_innova | Model_jazz | Model_jeep | Model_jetta | Model_koleos | Model_kuv | Model_kwid | Model_lancer | Model_laura | Model_linea | Model_lodgy | Model_logan | Model_m-class | Model_manza | Model_micra | Model_mobilio | Model_montero | Model_mustang | Model_mux | Model_nano | Model_new | Model_nexon | Model_nuvosport | Model_octavia | Model_omni | Model_one | Model_optra | Model_outlander | Model_pajero | Model_panamera | Model_passat | Model_petra | Model_platinum | Model_polo | Model_prius | Model_pulse | Model_punto | Model_q3 | Model_q5 | Model_q7 | Model_qualis | Model_quanto | Model_r-class | Model_rapid | Model_redi | Model_redi-go | Model_renault | Model_ritz | Model_rover | Model_rs5 | Model_s | Model_s-class | Model_s-cross | Model_s60 | Model_s80 | Model_safari | Model_sail | Model_santa | Model_santro | Model_scala | Model_scorpio | Model_siena | Model_sl-class | Model_slc | Model_slk-class | Model_sonata | Model_spark | Model_ssangyong | Model_sumo | Model_sunny | Model_superb | Model_swift | Model_sx4 | Model_tavera | Model_teana | Model_terrano | Model_thar | Model_tiago | Model_tigor | Model_tiguan | Model_tt | Model_tucson | Model_tuv | Model_v40 | Model_vento | Model_venture | Model_verito | Model_verna | Model_versa | Model_vitara | Model_wagon | Model_wr-v | Model_wrv | Model_x-trail | Model_x1 | Model_x3 | Model_x5 | Model_x6 | Model_xc60 | Model_xc90 | Model_xcent | Model_xe | Model_xenon | Model_xf | Model_xj | Model_xuv300 | Model_xuv500 | Model_xylo | Model_yeti | Model_z4 | Model_zen | Model_zest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4269 | -0.694078 | -0.351313 | -0.637638 | 1.136662 | -1.034356 | -0.841807 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2025 | -0.081329 | 2.126668 | -0.674075 | -0.765611 | -0.708133 | -0.731916 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5776 | -0.469629 | -0.351313 | 1.297640 | -0.287665 | 0.563805 | 1.136412 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1710 | -0.365282 | -0.351313 | -0.517681 | 0.732429 | -0.706486 | -0.545692 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2363 | -0.978527 | -0.351313 | -0.572951 | 0.137969 | -0.706486 | -0.565973 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Utility functions¶
def plot(history, name):
"""
Function to plot loss/accuracy
history: an object which stores the metrics and losses.
name: can be one of Loss or Accuracy
"""
fig, ax = plt.subplots() #Creating a subplot with figure and axes.
plt.plot(history.history[name]) #Plotting the train accuracy or train loss
plt.plot(history.history['val_'+name]) #Plotting the validation accuracy or validation loss
plt.title('Model ' + name.capitalize()) #Defining the title of the plot.
plt.ylabel(name.capitalize()) #Capitalizing the first letter.
plt.xlabel('Epoch') #Defining the label for the x-axis.
fig.legend(['Train', 'Validation'], loc="outside right upper") #Defining the legend, loc controls the position of the legend.
We'll create a dataframe to store the results from all the models we build
- We will be using metric functions defined in sklearn for RMSE, MAE, and $R^2$.
- We will define a function to calculate MAPE and adjusted $R^2$.
- We will create a function which will print out all the above metrics in one go.
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a neural network model
def model_performance(model,predictors,target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors).reshape(-1)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = {
"RMSE": [rmse],
"MAE": [mae],
"R-squared": [r2],
"Adj. R-squared": [adjr2],
"MAPE": [mape]}
return df_perf
columns = ["# hidden layers","# neurons - hidden layer","activation function - hidden layer ","# epochs","batch size","optimizer","time(secs)","Train_loss","Valid_loss","Train_R-squared","Valid_R-squared"]
results = pd.DataFrame(columns=columns)
Model building¶
We'll use $R^2$ as our metric of choice for the model to optimize.
#Defining the list of metrics to be used for all the models.
metrics = [tf.keras.metrics.R2Score(name="r2_score")]
Model 0¶
- We will start off with a simple neural network with
- No Hidden layers
- Gradient descent as the optimization algorithm.
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(1,input_dim=x_train.shape[1]))
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 1) │ 285 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 285 (1.11 KB)
Trainable params: 285 (1.11 KB)
Non-trainable params: 0 (0.00 B)
optimizer = keras.optimizers.SGD() # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
epochs = 10
batch_size = x_train.shape[0]
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/10 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 426ms/step - loss: 215.6193 - r2_score: -0.7024 - val_loss: 231.8262 - val_r2_score: -0.6387 Epoch 2/10 1/1 ━━━━━━━━━━━━━━━━━━━━ 1s 517ms/step - loss: 199.6886 - r2_score: -0.5766 - val_loss: 215.1222 - val_r2_score: -0.5207 Epoch 3/10 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 219ms/step - loss: 185.3213 - r2_score: -0.4632 - val_loss: 200.0059 - val_r2_score: -0.4138 Epoch 4/10 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 302ms/step - loss: 172.3545 - r2_score: -0.3608 - val_loss: 186.3157 - val_r2_score: -0.3170 Epoch 5/10 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 288ms/step - loss: 160.6429 - r2_score: -0.2683 - val_loss: 173.9070 - val_r2_score: -0.2293 Epoch 6/10 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 229ms/step - loss: 150.0572 - r2_score: -0.1848 - val_loss: 162.6510 - val_r2_score: -0.1498 Epoch 7/10 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 136ms/step - loss: 140.4819 - r2_score: -0.1092 - val_loss: 152.4325 - val_r2_score: -0.0775 Epoch 8/10 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 153ms/step - loss: 131.8141 - r2_score: -0.0407 - val_loss: 143.1481 - val_r2_score: -0.0119 Epoch 9/10 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 268ms/step - loss: 123.9616 - r2_score: 0.0213 - val_loss: 134.7058 - val_r2_score: 0.0478 Epoch 10/10 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 134ms/step - loss: 116.8424 - r2_score: 0.0775 - val_loss: 127.0229 - val_r2_score: 0.1021
print("Time taken in seconds ",end-start)
Time taken in seconds 2.8553929328918457
plot(history,'loss')
plot(history,'r2_score')
results.loc[0]=['-','-','-',epochs,batch_size,'GD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
results
| # hidden layers | # neurons - hidden layer | activation function - hidden layer | # epochs | batch size | optimizer | time(secs) | Train_loss | Valid_loss | Train_R-squared | Valid_R-squared | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | - | - | - | 10 | 4814 | GD | 2.855393 | 116.842415 | 127.022896 | 0.077489 | 0.102093 |
- Since it's a very simple neural network, the scores aren't good.
Model 1¶
- Let's try increasing the epochs to check whether the performance is improving or not.
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(1,input_dim=x_train.shape[1]))
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 1) │ 285 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 285 (1.11 KB)
Trainable params: 285 (1.11 KB)
Non-trainable params: 0 (0.00 B)
optimizer = keras.optimizers.SGD() # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
epochs = 25
batch_size = x_train.shape[0]
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 268ms/step - loss: 212.9668 - r2_score: -0.5842 - val_loss: 229.4064 - val_r2_score: -0.6216 Epoch 2/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 167ms/step - loss: 197.4542 - r2_score: -0.5590 - val_loss: 213.0991 - val_r2_score: -0.5064 Epoch 3/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 159ms/step - loss: 183.4492 - r2_score: -0.4484 - val_loss: 198.3264 - val_r2_score: -0.4019 Epoch 4/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 123ms/step - loss: 170.7959 - r2_score: -0.3485 - val_loss: 184.9332 - val_r2_score: -0.3073 Epoch 5/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 163ms/step - loss: 159.3555 - r2_score: -0.2582 - val_loss: 172.7811 - val_r2_score: -0.2214 Epoch 6/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 310ms/step - loss: 149.0038 - r2_score: -0.1764 - val_loss: 161.7463 - val_r2_score: -0.1434 Epoch 7/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 125ms/step - loss: 139.6302 - r2_score: -0.1024 - val_loss: 151.7180 - val_r2_score: -0.0725 Epoch 8/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 140ms/step - loss: 131.1358 - r2_score: -0.0354 - val_loss: 142.5970 - val_r2_score: -0.0080 Epoch 9/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 158ms/step - loss: 123.4324 - r2_score: 0.0255 - val_loss: 134.2945 - val_r2_score: 0.0507 Epoch 10/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 127ms/step - loss: 116.4407 - r2_score: 0.0807 - val_loss: 126.7310 - val_r2_score: 0.1042 Epoch 11/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 163ms/step - loss: 110.0903 - r2_score: 0.1308 - val_loss: 119.8349 - val_r2_score: 0.1529 Epoch 12/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 284ms/step - loss: 104.3176 - r2_score: 0.1764 - val_loss: 113.5422 - val_r2_score: 0.1974 Epoch 13/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 150ms/step - loss: 99.0660 - r2_score: 0.2178 - val_loss: 107.7954 - val_r2_score: 0.2380 Epoch 14/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 268ms/step - loss: 94.2847 - r2_score: 0.2556 - val_loss: 102.5428 - val_r2_score: 0.2751 Epoch 15/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 139ms/step - loss: 89.9281 - r2_score: 0.2900 - val_loss: 97.7379 - val_r2_score: 0.3091 Epoch 16/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 142ms/step - loss: 85.9551 - r2_score: 0.3214 - val_loss: 93.3388 - val_r2_score: 0.3402 Epoch 17/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 154ms/step - loss: 82.3291 - r2_score: 0.3500 - val_loss: 89.3079 - val_r2_score: 0.3687 Epoch 18/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 296ms/step - loss: 79.0171 - r2_score: 0.3761 - val_loss: 85.6112 - val_r2_score: 0.3948 Epoch 19/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 270ms/step - loss: 75.9892 - r2_score: 0.4000 - val_loss: 82.2183 - val_r2_score: 0.4188 Epoch 20/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 143ms/step - loss: 73.2189 - r2_score: 0.4219 - val_loss: 79.1014 - val_r2_score: 0.4408 Epoch 21/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 134ms/step - loss: 70.6820 - r2_score: 0.4419 - val_loss: 76.2357 - val_r2_score: 0.4611 Epoch 22/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 126ms/step - loss: 68.3569 - r2_score: 0.4603 - val_loss: 73.5987 - val_r2_score: 0.4797 Epoch 23/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 148ms/step - loss: 66.2241 - r2_score: 0.4771 - val_loss: 71.1701 - val_r2_score: 0.4969 Epoch 24/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 306ms/step - loss: 64.2660 - r2_score: 0.4926 - val_loss: 68.9315 - val_r2_score: 0.5127 Epoch 25/25 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 280ms/step - loss: 62.4666 - r2_score: 0.5068 - val_loss: 66.8663 - val_r2_score: 0.5273
print("Time taken in seconds ",end-start)
Time taken in seconds 4.981976270675659
plot(history,'loss')
plot(history,'r2_score')
results.loc[1]=['-','-','-',epochs,batch_size,'GD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
results
| # hidden layers | # neurons - hidden layer | activation function - hidden layer | # epochs | batch size | optimizer | time(secs) | Train_loss | Valid_loss | Train_R-squared | Valid_R-squared | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | - | - | - | 10 | 4814 | GD | 2.855393 | 116.842415 | 127.022896 | 0.077489 | 0.102093 |
| 1 | - | - | - | 25 | 4814 | GD | 4.981976 | 62.466640 | 66.866333 | 0.506804 | 0.527331 |
- As expected, we see an increase in the $R^2$, which is great.
Model 2¶
- Even though the performance of the previous model was good, the improvement in scores from one epoch to another is very slight since the updates happen only once.
- Let's now incorporate SGD to improve learning.
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(1,input_dim=x_train.shape[1]))
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 1) │ 285 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 285 (1.11 KB)
Trainable params: 285 (1.11 KB)
Non-trainable params: 0 (0.00 B)
optimizer = keras.optimizers.SGD() # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
epochs = 25
batch_size = 32
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 5s 30ms/step - loss: 86.8465 - r2_score: 0.4432 - val_loss: 35.5213 - val_r2_score: 0.7489 Epoch 2/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 33.6671 - r2_score: 0.7365 - val_loss: 32.9409 - val_r2_score: 0.7671 Epoch 3/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 23ms/step - loss: 31.4131 - r2_score: 0.7541 - val_loss: 31.3458 - val_r2_score: 0.7784 Epoch 4/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 28ms/step - loss: 30.0467 - r2_score: 0.7648 - val_loss: 30.2675 - val_r2_score: 0.7860 Epoch 5/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 29.1110 - r2_score: 0.7721 - val_loss: 29.4811 - val_r2_score: 0.7916 Epoch 6/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 5s 23ms/step - loss: 28.4153 - r2_score: 0.7775 - val_loss: 28.8686 - val_r2_score: 0.7959 Epoch 7/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 6s 29ms/step - loss: 27.8644 - r2_score: 0.7818 - val_loss: 28.3658 - val_r2_score: 0.7995 Epoch 8/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 23ms/step - loss: 27.4077 - r2_score: 0.7854 - val_loss: 27.9365 - val_r2_score: 0.8025 Epoch 9/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 6s 28ms/step - loss: 27.0161 - r2_score: 0.7884 - val_loss: 27.5594 - val_r2_score: 0.8052 Epoch 10/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 26.6721 - r2_score: 0.7911 - val_loss: 27.2215 - val_r2_score: 0.8076 Epoch 11/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 5s 23ms/step - loss: 26.3648 - r2_score: 0.7935 - val_loss: 26.9147 - val_r2_score: 0.8097 Epoch 12/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 29ms/step - loss: 26.0867 - r2_score: 0.7957 - val_loss: 26.6333 - val_r2_score: 0.8117 Epoch 13/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 25.8324 - r2_score: 0.7976 - val_loss: 26.3735 - val_r2_score: 0.8136 Epoch 14/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 25.5982 - r2_score: 0.7995 - val_loss: 26.1323 - val_r2_score: 0.8153 Epoch 15/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 6s 30ms/step - loss: 25.3811 - r2_score: 0.8012 - val_loss: 25.9075 - val_r2_score: 0.8169 Epoch 16/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 27ms/step - loss: 25.1790 - r2_score: 0.8027 - val_loss: 25.6974 - val_r2_score: 0.8183 Epoch 17/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 5s 27ms/step - loss: 24.9900 - r2_score: 0.8042 - val_loss: 25.5004 - val_r2_score: 0.8197 Epoch 18/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 5s 24ms/step - loss: 24.8125 - r2_score: 0.8056 - val_loss: 25.3153 - val_r2_score: 0.8210 Epoch 19/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 24.6455 - r2_score: 0.8069 - val_loss: 25.1410 - val_r2_score: 0.8223 Epoch 20/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 29ms/step - loss: 24.4877 - r2_score: 0.8081 - val_loss: 24.9766 - val_r2_score: 0.8234 Epoch 21/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 23ms/step - loss: 24.3385 - r2_score: 0.8093 - val_loss: 24.8212 - val_r2_score: 0.8245 Epoch 22/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 5s 23ms/step - loss: 24.1969 - r2_score: 0.8104 - val_loss: 24.6741 - val_r2_score: 0.8256 Epoch 23/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 6s 28ms/step - loss: 24.0624 - r2_score: 0.8114 - val_loss: 24.5347 - val_r2_score: 0.8266 Epoch 24/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 23.9343 - r2_score: 0.8124 - val_loss: 24.4023 - val_r2_score: 0.8275 Epoch 25/25 151/151 ━━━━━━━━━━━━━━━━━━━━ 6s 29ms/step - loss: 23.8121 - r2_score: 0.8134 - val_loss: 24.2764 - val_r2_score: 0.8284
print("Time taken in seconds ",end-start)
Time taken in seconds 116.03793096542358
plot(history,'loss')
plot(history,'r2_score')
results.loc[2]=['-','-','-',epochs,batch_size,'SGD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
results
| # hidden layers | # neurons - hidden layer | activation function - hidden layer | # epochs | batch size | optimizer | time(secs) | Train_loss | Valid_loss | Train_R-squared | Valid_R-squared | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | - | - | - | 10 | 4814 | GD | 2.855393 | 116.842415 | 127.022896 | 0.077489 | 0.102093 |
| 1 | - | - | - | 25 | 4814 | GD | 4.981976 | 62.466640 | 66.866333 | 0.506804 | 0.527331 |
| 2 | - | - | - | 25 | 32 | SGD | 116.037931 | 25.865023 | 24.276419 | 0.795787 | 0.828393 |
- After the first epoch, we see an $R^2$ of ~0.72, which is great.
- Also, the improvement in the $R^2$ after each epoch has also increased.
- Note that the time taken to train the model has also increased as model parameters are being updated more often.
Model 3¶
- Let's now increase the batch size to 64 to see if the performance improves.
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(1,input_dim=x_train.shape[1]))
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 1) │ 285 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 285 (1.11 KB)
Trainable params: 285 (1.11 KB)
Non-trainable params: 0 (0.00 B)
optimizer = keras.optimizers.SGD() # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
epochs = 25
batch_size = 64
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 27ms/step - loss: 111.9825 - r2_score: 0.3936 - val_loss: 38.6424 - val_r2_score: 0.7268 Epoch 2/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 26ms/step - loss: 36.7211 - r2_score: 0.7141 - val_loss: 35.5817 - val_r2_score: 0.7485 Epoch 3/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 26ms/step - loss: 34.1328 - r2_score: 0.7338 - val_loss: 34.1166 - val_r2_score: 0.7588 Epoch 4/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 31ms/step - loss: 32.7173 - r2_score: 0.7448 - val_loss: 33.0035 - val_r2_score: 0.7667 Epoch 5/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 31.6776 - r2_score: 0.7529 - val_loss: 32.1211 - val_r2_score: 0.7729 Epoch 6/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 24ms/step - loss: 30.8699 - r2_score: 0.7592 - val_loss: 31.4058 - val_r2_score: 0.7780 Epoch 7/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 26ms/step - loss: 30.2217 - r2_score: 0.7642 - val_loss: 30.8142 - val_r2_score: 0.7822 Epoch 8/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 26ms/step - loss: 29.6879 - r2_score: 0.7684 - val_loss: 30.3152 - val_r2_score: 0.7857 Epoch 9/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 29.2380 - r2_score: 0.7718 - val_loss: 29.8866 - val_r2_score: 0.7887 Epoch 10/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 32ms/step - loss: 28.8512 - r2_score: 0.7748 - val_loss: 29.5122 - val_r2_score: 0.7914 Epoch 11/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 26ms/step - loss: 28.5128 - r2_score: 0.7775 - val_loss: 29.1803 - val_r2_score: 0.7937 Epoch 12/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 26ms/step - loss: 28.2123 - r2_score: 0.7798 - val_loss: 28.8820 - val_r2_score: 0.7958 Epoch 13/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 26ms/step - loss: 27.9418 - r2_score: 0.7819 - val_loss: 28.6108 - val_r2_score: 0.7978 Epoch 14/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 26ms/step - loss: 27.6958 - r2_score: 0.7838 - val_loss: 28.3617 - val_r2_score: 0.7995 Epoch 15/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 32ms/step - loss: 27.4698 - r2_score: 0.7856 - val_loss: 28.1310 - val_r2_score: 0.8011 Epoch 16/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 25ms/step - loss: 27.2606 - r2_score: 0.7872 - val_loss: 27.9158 - val_r2_score: 0.8027 Epoch 17/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 27ms/step - loss: 27.0657 - r2_score: 0.7887 - val_loss: 27.7139 - val_r2_score: 0.8041 Epoch 18/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 26ms/step - loss: 26.8830 - r2_score: 0.7901 - val_loss: 27.5234 - val_r2_score: 0.8054 Epoch 19/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 26ms/step - loss: 26.7110 - r2_score: 0.7915 - val_loss: 27.3431 - val_r2_score: 0.8067 Epoch 20/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 36ms/step - loss: 26.5485 - r2_score: 0.7927 - val_loss: 27.1717 - val_r2_score: 0.8079 Epoch 21/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 34ms/step - loss: 26.3942 - r2_score: 0.7939 - val_loss: 27.0083 - val_r2_score: 0.8091 Epoch 22/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 26ms/step - loss: 26.2475 - r2_score: 0.7951 - val_loss: 26.8523 - val_r2_score: 0.8102 Epoch 23/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - loss: 26.1075 - r2_score: 0.7961 - val_loss: 26.7029 - val_r2_score: 0.8112 Epoch 24/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 24ms/step - loss: 25.9737 - r2_score: 0.7972 - val_loss: 26.5596 - val_r2_score: 0.8123 Epoch 25/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 24ms/step - loss: 25.8455 - r2_score: 0.7982 - val_loss: 26.4221 - val_r2_score: 0.8132
print("Time taken in seconds ",end-start)
Time taken in seconds 66.82818222045898
plot(history,'loss')
plot(history,'r2_score')
results.loc[3]=['-','-','-',epochs,batch_size,'SGD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
results
| # hidden layers | # neurons - hidden layer | activation function - hidden layer | # epochs | batch size | optimizer | time(secs) | Train_loss | Valid_loss | Train_R-squared | Valid_R-squared | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | - | - | - | 10 | 4814 | GD | 2.855393 | 116.842415 | 127.022896 | 0.077489 | 0.102093 |
| 1 | - | - | - | 25 | 4814 | GD | 4.981976 | 62.466640 | 66.866333 | 0.506804 | 0.527331 |
| 2 | - | - | - | 25 | 32 | SGD | 116.037931 | 25.865023 | 24.276419 | 0.795787 | 0.828393 |
| 3 | - | - | - | 25 | 64 | SGD | 66.828182 | 27.897743 | 26.422087 | 0.779738 | 0.813226 |
- The performance hasn't improved, but the time taken to train the model has reduced.
- There's always a tradeoff here - performance vs computation time.
Model 4¶
- Let's now add a hidden layer with 128 neurons.
- We'll use sigmoid as the activation function.
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(128,activation="sigmoid",input_dim=x_train.shape[1]))
model.add(Dense(1))
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 128) │ 36,480 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 1) │ 129 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 36,609 (143.00 KB)
Trainable params: 36,609 (143.00 KB)
Non-trainable params: 0 (0.00 B)
optimizer = keras.optimizers.SGD() # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
epochs = 25
batch_size = 64
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 32ms/step - loss: 100.2605 - r2_score: 0.4408 - val_loss: 36.9329 - val_r2_score: 0.7389 Epoch 2/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 31ms/step - loss: 36.0253 - r2_score: 0.7199 - val_loss: 33.1919 - val_r2_score: 0.7654 Epoch 3/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 31.8483 - r2_score: 0.7519 - val_loss: 30.7305 - val_r2_score: 0.7828 Epoch 4/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 29.2322 - r2_score: 0.7721 - val_loss: 28.7463 - val_r2_score: 0.7968 Epoch 5/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 31ms/step - loss: 27.2142 - r2_score: 0.7876 - val_loss: 27.1094 - val_r2_score: 0.8084 Epoch 6/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 38ms/step - loss: 25.5121 - r2_score: 0.8008 - val_loss: 25.6944 - val_r2_score: 0.8184 Epoch 7/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 30ms/step - loss: 23.9985 - r2_score: 0.8125 - val_loss: 24.4229 - val_r2_score: 0.8274 Epoch 8/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 22.6255 - r2_score: 0.8231 - val_loss: 23.2575 - val_r2_score: 0.8356 Epoch 9/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 33ms/step - loss: 21.3786 - r2_score: 0.8328 - val_loss: 22.1796 - val_r2_score: 0.8432 Epoch 10/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 30ms/step - loss: 20.2531 - r2_score: 0.8415 - val_loss: 21.1771 - val_r2_score: 0.8503 Epoch 11/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 19.2443 - r2_score: 0.8494 - val_loss: 20.2422 - val_r2_score: 0.8569 Epoch 12/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 18.3453 - r2_score: 0.8563 - val_loss: 19.3709 - val_r2_score: 0.8631 Epoch 13/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 35ms/step - loss: 17.5467 - r2_score: 0.8626 - val_loss: 18.5624 - val_r2_score: 0.8688 Epoch 14/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 29ms/step - loss: 16.8377 - r2_score: 0.8681 - val_loss: 17.8170 - val_r2_score: 0.8741 Epoch 15/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 29ms/step - loss: 16.2078 - r2_score: 0.8730 - val_loss: 17.1348 - val_r2_score: 0.8789 Epoch 16/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 15.6470 - r2_score: 0.8774 - val_loss: 16.5148 - val_r2_score: 0.8833 Epoch 17/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 15.1466 - r2_score: 0.8813 - val_loss: 15.9552 - val_r2_score: 0.8872 Epoch 18/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 30ms/step - loss: 14.6988 - r2_score: 0.8848 - val_loss: 15.4533 - val_r2_score: 0.8908 Epoch 19/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 14.2969 - r2_score: 0.8879 - val_loss: 15.0056 - val_r2_score: 0.8939 Epoch 20/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 13.9348 - r2_score: 0.8908 - val_loss: 14.6080 - val_r2_score: 0.8967 Epoch 21/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 13.6071 - r2_score: 0.8934 - val_loss: 14.2554 - val_r2_score: 0.8992 Epoch 22/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 33ms/step - loss: 13.3093 - r2_score: 0.8957 - val_loss: 13.9424 - val_r2_score: 0.9014 Epoch 23/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 36ms/step - loss: 13.0374 - r2_score: 0.8979 - val_loss: 13.6638 - val_r2_score: 0.9034 Epoch 24/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 46ms/step - loss: 12.7880 - r2_score: 0.8998 - val_loss: 13.4147 - val_r2_score: 0.9052 Epoch 25/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 32ms/step - loss: 12.5583 - r2_score: 0.9016 - val_loss: 13.1910 - val_r2_score: 0.9068
print("Time taken in seconds ",end-start)
Time taken in seconds 81.29296040534973
plot(history,'loss')
plot(history,'r2_score')
results.loc[4]=[1,128,'sigmoid',epochs,batch_size,'SGD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
results
| # hidden layers | # neurons - hidden layer | activation function - hidden layer | # epochs | batch size | optimizer | time(secs) | Train_loss | Valid_loss | Train_R-squared | Valid_R-squared | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | - | - | - | 10 | 4814 | GD | 2.855393 | 116.842415 | 127.022896 | 0.077489 | 0.102093 |
| 1 | - | - | - | 25 | 4814 | GD | 4.981976 | 62.466640 | 66.866333 | 0.506804 | 0.527331 |
| 2 | - | - | - | 25 | 32 | SGD | 116.037931 | 25.865023 | 24.276419 | 0.795787 | 0.828393 |
| 3 | - | - | - | 25 | 64 | SGD | 66.828182 | 27.897743 | 26.422087 | 0.779738 | 0.813226 |
| 4 | 1 | 128 | sigmoid | 25 | 64 | SGD | 81.292960 | 13.616706 | 13.191008 | 0.892491 | 0.906755 |
- We see an improvement in the model performance.
- The time taken too has not increased drastically.
Model 5¶
- We'll now change the activation for the hidden layer from sigmoid to tanh.
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(128,activation="tanh",input_dim=x_train.shape[1]))
model.add(Dense(1))
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 128) │ 36,480 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 1) │ 129 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 36,609 (143.00 KB)
Trainable params: 36,609 (143.00 KB)
Non-trainable params: 0 (0.00 B)
optimizer = keras.optimizers.SGD() # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
epochs = 25
batch_size = 64
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 79.5473 - r2_score: 0.5943 - val_loss: 35.1616 - val_r2_score: 0.7514 Epoch 2/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 30.2468 - r2_score: 0.7651 - val_loss: 27.9162 - val_r2_score: 0.8027 Epoch 3/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 25.7403 - r2_score: 0.7997 - val_loss: 23.4933 - val_r2_score: 0.8339 Epoch 4/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 31ms/step - loss: 22.1494 - r2_score: 0.8272 - val_loss: 21.4471 - val_r2_score: 0.8484 Epoch 5/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 31ms/step - loss: 19.6231 - r2_score: 0.8466 - val_loss: 19.7989 - val_r2_score: 0.8600 Epoch 6/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 36ms/step - loss: 17.8089 - r2_score: 0.8607 - val_loss: 18.0529 - val_r2_score: 0.8724 Epoch 7/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 31ms/step - loss: 16.4196 - r2_score: 0.8715 - val_loss: 16.4278 - val_r2_score: 0.8839 Epoch 8/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 31ms/step - loss: 15.3018 - r2_score: 0.8802 - val_loss: 15.0046 - val_r2_score: 0.8939 Epoch 9/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 14.3681 - r2_score: 0.8874 - val_loss: 13.8965 - val_r2_score: 0.9018 Epoch 10/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 13.5662 - r2_score: 0.8936 - val_loss: 13.0059 - val_r2_score: 0.9081 Epoch 11/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 12.8818 - r2_score: 0.8989 - val_loss: 12.3403 - val_r2_score: 0.9128 Epoch 12/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 29ms/step - loss: 12.2779 - r2_score: 0.9036 - val_loss: 11.8500 - val_r2_score: 0.9162 Epoch 13/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 11.7520 - r2_score: 0.9077 - val_loss: 11.4965 - val_r2_score: 0.9187 Epoch 14/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 11.2884 - r2_score: 0.9113 - val_loss: 11.2293 - val_r2_score: 0.9206 Epoch 15/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 10.8749 - r2_score: 0.9146 - val_loss: 11.0060 - val_r2_score: 0.9222 Epoch 16/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 30ms/step - loss: 10.5038 - r2_score: 0.9175 - val_loss: 10.8015 - val_r2_score: 0.9236 Epoch 17/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 10.1706 - r2_score: 0.9201 - val_loss: 10.6096 - val_r2_score: 0.9250 Epoch 18/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 31ms/step - loss: 9.8682 - r2_score: 0.9225 - val_loss: 10.4366 - val_r2_score: 0.9262 Epoch 19/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 39ms/step - loss: 9.5903 - r2_score: 0.9247 - val_loss: 10.2879 - val_r2_score: 0.9273 Epoch 20/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 29ms/step - loss: 9.3318 - r2_score: 0.9267 - val_loss: 10.1541 - val_r2_score: 0.9282 Epoch 21/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 9.0828 - r2_score: 0.9287 - val_loss: 10.0276 - val_r2_score: 0.9291 Epoch 22/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 8.8365 - r2_score: 0.9306 - val_loss: 9.9145 - val_r2_score: 0.9299 Epoch 23/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 8.5971 - r2_score: 0.9325 - val_loss: 9.8223 - val_r2_score: 0.9306 Epoch 24/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 28ms/step - loss: 8.3698 - r2_score: 0.9343 - val_loss: 9.7550 - val_r2_score: 0.9310 Epoch 25/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 8.1562 - r2_score: 0.9360 - val_loss: 9.7151 - val_r2_score: 0.9313
print("Time taken in seconds ",end-start)
Time taken in seconds 77.24922800064087
plot(history,'loss')
plot(history,'r2_score')
results.loc[5]=[1,128,'tanh',epochs,batch_size,'SGD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
results
| # hidden layers | # neurons - hidden layer | activation function - hidden layer | # epochs | batch size | optimizer | time(secs) | Train_loss | Valid_loss | Train_R-squared | Valid_R-squared | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | - | - | - | 10 | 4814 | GD | 2.855393 | 116.842415 | 127.022896 | 0.077489 | 0.102093 |
| 1 | - | - | - | 25 | 4814 | GD | 4.981976 | 62.466640 | 66.866333 | 0.506804 | 0.527331 |
| 2 | - | - | - | 25 | 32 | SGD | 116.037931 | 25.865023 | 24.276419 | 0.795787 | 0.828393 |
| 3 | - | - | - | 25 | 64 | SGD | 66.828182 | 27.897743 | 26.422087 | 0.779738 | 0.813226 |
| 4 | 1 | 128 | sigmoid | 25 | 64 | SGD | 81.292960 | 13.616706 | 13.191008 | 0.892491 | 0.906755 |
| 5 | 1 | 128 | tanh | 25 | 64 | SGD | 77.249228 | 8.859550 | 9.715087 | 0.930051 | 0.931325 |
- Changing the activation has improved the $R^2$.
Model 6¶
- We'll now change the activation for the hidden layer from tanh to relu
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(128,activation="relu",input_dim=x_train.shape[1]))
model.add(Dense(1))
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 128) │ 36,480 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 1) │ 129 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 36,609 (143.00 KB)
Trainable params: 36,609 (143.00 KB)
Non-trainable params: 0 (0.00 B)
optimizer = keras.optimizers.SGD() # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
epochs = 25
batch_size = 64
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 35ms/step - loss: 78.1409 - r2_score: 0.6149 - val_loss: 22.6298 - val_r2_score: 0.8400 Epoch 2/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 31ms/step - loss: 24.4104 - r2_score: 0.8088 - val_loss: 20.3330 - val_r2_score: 0.8563 Epoch 3/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 31ms/step - loss: 20.8962 - r2_score: 0.8361 - val_loss: 20.2712 - val_r2_score: 0.8567 Epoch 4/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 31ms/step - loss: 18.9292 - r2_score: 0.8516 - val_loss: 19.0919 - val_r2_score: 0.8650 Epoch 5/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 35ms/step - loss: 17.5839 - r2_score: 0.8622 - val_loss: 18.7435 - val_r2_score: 0.8675 Epoch 6/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 16.6699 - r2_score: 0.8696 - val_loss: 18.1998 - val_r2_score: 0.8713 Epoch 7/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 15.9100 - r2_score: 0.8757 - val_loss: 18.0499 - val_r2_score: 0.8724 Epoch 8/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 15.2037 - r2_score: 0.8814 - val_loss: 17.2522 - val_r2_score: 0.8780 Epoch 9/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 14.6969 - r2_score: 0.8855 - val_loss: 16.7128 - val_r2_score: 0.8819 Epoch 10/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 39ms/step - loss: 14.1779 - r2_score: 0.8896 - val_loss: 16.0255 - val_r2_score: 0.8867 Epoch 11/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 30ms/step - loss: 13.7966 - r2_score: 0.8926 - val_loss: 16.2979 - val_r2_score: 0.8848 Epoch 12/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 13.3378 - r2_score: 0.8964 - val_loss: 15.2482 - val_r2_score: 0.8922 Epoch 13/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 28ms/step - loss: 13.0258 - r2_score: 0.8988 - val_loss: 15.3594 - val_r2_score: 0.8914 Epoch 14/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 39ms/step - loss: 12.6184 - r2_score: 0.9021 - val_loss: 14.2531 - val_r2_score: 0.8992 Epoch 15/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 30ms/step - loss: 12.3351 - r2_score: 0.9042 - val_loss: 14.5103 - val_r2_score: 0.8974 Epoch 16/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 11.9488 - r2_score: 0.9074 - val_loss: 14.1340 - val_r2_score: 0.9001 Epoch 17/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 11.7498 - r2_score: 0.9089 - val_loss: 15.1942 - val_r2_score: 0.8926 Epoch 18/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 39ms/step - loss: 11.6430 - r2_score: 0.9100 - val_loss: 14.6469 - val_r2_score: 0.8965 Epoch 19/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 11.2995 - r2_score: 0.9126 - val_loss: 14.3911 - val_r2_score: 0.8983 Epoch 20/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 10.9995 - r2_score: 0.9149 - val_loss: 15.4576 - val_r2_score: 0.8907 Epoch 21/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 29ms/step - loss: 11.0765 - r2_score: 0.9145 - val_loss: 13.6486 - val_r2_score: 0.9035 Epoch 22/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 29ms/step - loss: 10.4216 - r2_score: 0.9194 - val_loss: 14.7097 - val_r2_score: 0.8960 Epoch 23/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 39ms/step - loss: 10.3984 - r2_score: 0.9196 - val_loss: 14.1583 - val_r2_score: 0.8999 Epoch 24/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 31ms/step - loss: 10.1541 - r2_score: 0.9215 - val_loss: 14.2825 - val_r2_score: 0.8990 Epoch 25/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 9.9332 - r2_score: 0.9232 - val_loss: 14.3455 - val_r2_score: 0.8986
print("Time taken in seconds ",end-start)
Time taken in seconds 73.81564784049988
plot(history,'loss')
plot(history,'r2_score')
results.loc[6]=[1,128,'relu',epochs,batch_size,'SGD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
results
| # hidden layers | # neurons - hidden layer | activation function - hidden layer | # epochs | batch size | optimizer | time(secs) | Train_loss | Valid_loss | Train_R-squared | Valid_R-squared | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | - | - | - | 10 | 4814 | GD | 2.855393 | 116.842415 | 127.022896 | 0.077489 | 0.102093 |
| 1 | - | - | - | 25 | 4814 | GD | 4.981976 | 62.466640 | 66.866333 | 0.506804 | 0.527331 |
| 2 | - | - | - | 25 | 32 | SGD | 116.037931 | 25.865023 | 24.276419 | 0.795787 | 0.828393 |
| 3 | - | - | - | 25 | 64 | SGD | 66.828182 | 27.897743 | 26.422087 | 0.779738 | 0.813226 |
| 4 | 1 | 128 | sigmoid | 25 | 64 | SGD | 81.292960 | 13.616706 | 13.191008 | 0.892491 | 0.906755 |
| 5 | 1 | 128 | tanh | 25 | 64 | SGD | 77.249228 | 8.859550 | 9.715087 | 0.930051 | 0.931325 |
| 6 | 1 | 128 | relu | 25 | 64 | SGD | 73.815648 | 9.730357 | 14.345462 | 0.923175 | 0.898594 |
- We couldn't see much improvement
Model 7¶
- We will now add one more hidden layer with 32 neurons.
- We'll use relu activation in both hidden layers.
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(128,activation="relu",input_dim=x_train.shape[1]))
model.add(Dense(32,activation="relu"))
model.add(Dense(1))
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 128) │ 36,480 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 32) │ 4,128 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_2 (Dense) │ (None, 1) │ 33 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 40,641 (158.75 KB)
Trainable params: 40,641 (158.75 KB)
Non-trainable params: 0 (0.00 B)
optimizer = keras.optimizers.SGD() # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
epochs = 25
batch_size = 64
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 39ms/step - loss: 119.0316 - r2_score: 0.3395 - val_loss: 174.7455 - val_r2_score: -0.2353 Epoch 2/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 34ms/step - loss: 141.1910 - r2_score: -0.0877 - val_loss: 125.4722 - val_r2_score: 0.1131 Epoch 3/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 6s 44ms/step - loss: 103.6209 - r2_score: 0.2028 - val_loss: 92.8917 - val_r2_score: 0.3434 Epoch 4/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 44ms/step - loss: 80.4926 - r2_score: 0.3807 - val_loss: 68.0757 - val_r2_score: 0.5188 Epoch 5/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 61.7936 - r2_score: 0.5238 - val_loss: 71.7318 - val_r2_score: 0.4929 Epoch 6/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 32ms/step - loss: 74.3823 - r2_score: 0.4244 - val_loss: 67.4626 - val_r2_score: 0.5231 Epoch 7/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 38ms/step - loss: 71.3977 - r2_score: 0.4470 - val_loss: 65.2459 - val_r2_score: 0.5388 Epoch 8/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 34ms/step - loss: 72.4106 - r2_score: 0.4396 - val_loss: 140.3398 - val_r2_score: 0.0080 Epoch 9/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 32ms/step - loss: 69.9716 - r2_score: 0.4682 - val_loss: 66.5744 - val_r2_score: 0.5294 Epoch 10/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 68.2321 - r2_score: 0.4719 - val_loss: 56.5915 - val_r2_score: 0.6000 Epoch 11/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 35ms/step - loss: 58.4738 - r2_score: 0.5476 - val_loss: 49.1373 - val_r2_score: 0.6527 Epoch 12/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 50.7393 - r2_score: 0.6077 - val_loss: 43.1657 - val_r2_score: 0.6949 Epoch 13/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 6s 44ms/step - loss: 45.3931 - r2_score: 0.6491 - val_loss: 40.0219 - val_r2_score: 0.7171 Epoch 14/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 33ms/step - loss: 41.0568 - r2_score: 0.6832 - val_loss: 35.5701 - val_r2_score: 0.7486 Epoch 15/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 37.9652 - r2_score: 0.7067 - val_loss: 32.7239 - val_r2_score: 0.7687 Epoch 16/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 35ms/step - loss: 35.4936 - r2_score: 0.7256 - val_loss: 30.1875 - val_r2_score: 0.7866 Epoch 17/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 36ms/step - loss: 33.8804 - r2_score: 0.7376 - val_loss: 27.9216 - val_r2_score: 0.8026 Epoch 18/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 33ms/step - loss: 31.1315 - r2_score: 0.7591 - val_loss: 26.1075 - val_r2_score: 0.8154 Epoch 19/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 35ms/step - loss: 29.7628 - r2_score: 0.7695 - val_loss: 27.1189 - val_r2_score: 0.8083 Epoch 20/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 34ms/step - loss: 30.3165 - r2_score: 0.7646 - val_loss: 23.1413 - val_r2_score: 0.8364 Epoch 21/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 34ms/step - loss: 28.9596 - r2_score: 0.7751 - val_loss: 32.8746 - val_r2_score: 0.7676 Epoch 22/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 6s 44ms/step - loss: 27.2722 - r2_score: 0.7889 - val_loss: 21.7117 - val_r2_score: 0.8465 Epoch 23/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 33ms/step - loss: 27.3585 - r2_score: 0.7874 - val_loss: 20.9258 - val_r2_score: 0.8521 Epoch 24/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 25.7879 - r2_score: 0.7994 - val_loss: 20.4467 - val_r2_score: 0.8555 Epoch 25/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 6s 45ms/step - loss: 23.9112 - r2_score: 0.8142 - val_loss: 19.0233 - val_r2_score: 0.8655
print("Time taken in seconds ",end-start)
Time taken in seconds 104.9405345916748
plot(history,'loss')
plot(history,'r2_score')
results.loc[7]=[2,[128,32],['relu','relu'],epochs,batch_size,'SGD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
results
| # hidden layers | # neurons - hidden layer | activation function - hidden layer | # epochs | batch size | optimizer | time(secs) | Train_loss | Valid_loss | Train_R-squared | Valid_R-squared | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | - | - | - | 10 | 4814 | GD | 2.855393 | 116.842415 | 127.022896 | 0.077489 | 0.102093 |
| 1 | - | - | - | 25 | 4814 | GD | 4.981976 | 62.466640 | 66.866333 | 0.506804 | 0.527331 |
| 2 | - | - | - | 25 | 32 | SGD | 116.037931 | 25.865023 | 24.276419 | 0.795787 | 0.828393 |
| 3 | - | - | - | 25 | 64 | SGD | 66.828182 | 27.897743 | 26.422087 | 0.779738 | 0.813226 |
| 4 | 1 | 128 | sigmoid | 25 | 64 | SGD | 81.292960 | 13.616706 | 13.191008 | 0.892491 | 0.906755 |
| 5 | 1 | 128 | tanh | 25 | 64 | SGD | 77.249228 | 8.859550 | 9.715087 | 0.930051 | 0.931325 |
| 6 | 1 | 128 | relu | 25 | 64 | SGD | 73.815648 | 9.730357 | 14.345462 | 0.923175 | 0.898594 |
| 7 | 2 | [128, 32] | [relu, relu] | 25 | 64 | SGD | 104.940535 | 25.567276 | 19.023302 | 0.798137 | 0.865527 |
- Adding a hidden layer didn't improve the performance of the model.
Model Performance Comparison and Final Model Selection¶
results
| # hidden layers | # neurons - hidden layer | activation function - hidden layer | # epochs | batch size | optimizer | time(secs) | Train_loss | Valid_loss | Train_R-squared | Valid_R-squared | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | - | - | - | 10 | 4814 | GD | 2.855393 | 116.842415 | 127.022896 | 0.077489 | 0.102093 |
| 1 | - | - | - | 25 | 4814 | GD | 4.981976 | 62.466640 | 66.866333 | 0.506804 | 0.527331 |
| 2 | - | - | - | 25 | 32 | SGD | 116.037931 | 25.865023 | 24.276419 | 0.795787 | 0.828393 |
| 3 | - | - | - | 25 | 64 | SGD | 66.828182 | 27.897743 | 26.422087 | 0.779738 | 0.813226 |
| 4 | 1 | 128 | sigmoid | 25 | 64 | SGD | 81.292960 | 13.616706 | 13.191008 | 0.892491 | 0.906755 |
| 5 | 1 | 128 | tanh | 25 | 64 | SGD | 77.249228 | 8.859550 | 9.715087 | 0.930051 | 0.931325 |
| 6 | 1 | 128 | relu | 25 | 64 | SGD | 73.815648 | 9.730357 | 14.345462 | 0.923175 | 0.898594 |
| 7 | 2 | [128, 32] | [relu, relu] | 25 | 64 | SGD | 104.940535 | 25.567276 | 19.023302 | 0.798137 | 0.865527 |
Among all other models, Model 5 and 6 achieved the highest training and validation scores.
We can choose any one of them. Let's choose the model 6 as there is some difference in the train and valid scores and it seems to be realistic.
We'll go ahead with this model as our final model.
Let's rebuild it and check its performance across multiple metrics
Final Model¶
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(128,activation="relu",input_dim=x_train.shape[1]))
model.add(Dense(1))
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 128) │ 36,480 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 1) │ 129 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 36,609 (143.00 KB)
Trainable params: 36,609 (143.00 KB)
Non-trainable params: 0 (0.00 B)
optimizer = keras.optimizers.SGD() # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
epochs = 25
batch_size = 64
history = model.fit(x_train, y_train, validation_data=(x_test,y_test) , batch_size=batch_size, epochs=epochs)
Epoch 1/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 31ms/step - loss: 78.9293 - r2_score: 0.5931 - val_loss: 15.2803 - val_r2_score: 0.8406 Epoch 2/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 24.1295 - r2_score: 0.8110 - val_loss: 13.2398 - val_r2_score: 0.8619 Epoch 3/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 20.7745 - r2_score: 0.8370 - val_loss: 12.4987 - val_r2_score: 0.8696 Epoch 4/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 30ms/step - loss: 18.8243 - r2_score: 0.8524 - val_loss: 12.0415 - val_r2_score: 0.8744 Epoch 5/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 17.5857 - r2_score: 0.8623 - val_loss: 11.4448 - val_r2_score: 0.8806 Epoch 6/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 16.5880 - r2_score: 0.8702 - val_loss: 11.1857 - val_r2_score: 0.8833 Epoch 7/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 35ms/step - loss: 15.7657 - r2_score: 0.8769 - val_loss: 10.9638 - val_r2_score: 0.8856 Epoch 8/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 29ms/step - loss: 15.0834 - r2_score: 0.8824 - val_loss: 10.9710 - val_r2_score: 0.8855 Epoch 9/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 29ms/step - loss: 14.5815 - r2_score: 0.8865 - val_loss: 10.3246 - val_r2_score: 0.8923 Epoch 10/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 31ms/step - loss: 14.1321 - r2_score: 0.8901 - val_loss: 10.7137 - val_r2_score: 0.8882 Epoch 11/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 38ms/step - loss: 13.6907 - r2_score: 0.8937 - val_loss: 10.4253 - val_r2_score: 0.8912 Epoch 12/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 28ms/step - loss: 13.3379 - r2_score: 0.8965 - val_loss: 9.9991 - val_r2_score: 0.8957 Epoch 13/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 12.9711 - r2_score: 0.8994 - val_loss: 9.8395 - val_r2_score: 0.8973 Epoch 14/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 12.6858 - r2_score: 0.9017 - val_loss: 9.5879 - val_r2_score: 0.9000 Epoch 15/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 12.3916 - r2_score: 0.9040 - val_loss: 9.1747 - val_r2_score: 0.9043 Epoch 16/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 11.9349 - r2_score: 0.9075 - val_loss: 8.9432 - val_r2_score: 0.9067 Epoch 17/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 11.7616 - r2_score: 0.9089 - val_loss: 8.7636 - val_r2_score: 0.9086 Epoch 18/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 28ms/step - loss: 11.3985 - r2_score: 0.9117 - val_loss: 8.9226 - val_r2_score: 0.9069 Epoch 19/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 28ms/step - loss: 11.2242 - r2_score: 0.9131 - val_loss: 8.5445 - val_r2_score: 0.9108 Epoch 20/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 41ms/step - loss: 10.9598 - r2_score: 0.9151 - val_loss: 8.6441 - val_r2_score: 0.9098 Epoch 21/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 30ms/step - loss: 10.7909 - r2_score: 0.9165 - val_loss: 8.3364 - val_r2_score: 0.9130 Epoch 22/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 10.5877 - r2_score: 0.9180 - val_loss: 7.8640 - val_r2_score: 0.9179 Epoch 23/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 10.2563 - r2_score: 0.9205 - val_loss: 7.8603 - val_r2_score: 0.9180 Epoch 24/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 41ms/step - loss: 10.1446 - r2_score: 0.9213 - val_loss: 7.5617 - val_r2_score: 0.9211 Epoch 25/25 76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 29ms/step - loss: 9.7241 - r2_score: 0.9246 - val_loss: 7.7915 - val_r2_score: 0.9187
train_perf = model_performance(model,x_train,y_train)
print("Train performance")
pd.DataFrame(train_perf)
151/151 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step Train performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 3.305341 | 1.563728 | 0.913741 | 0.908332 | 19.067221 |
x_val.isnull().sum()
| 0 | |
|---|---|
| Kilometers_Driven | 0 |
| Seats | 0 |
| New_Price | 0 |
| mileage_num | 0 |
| engine_num | 0 |
| ... | ... |
| Model_xylo | 0 |
| Model_yeti | 0 |
| Model_z4 | 0 |
| Model_zen | 0 |
| Model_zest | 0 |
284 rows × 1 columns
y_val.isnull().sum()
0
valid_perf = model_performance(model,x_val,y_val)
print("Validation data performance")
pd.DataFrame(valid_perf)
19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step Validation data performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 3.588983 | 1.87963 | 0.908947 | 0.827374 | 19.785901 |
test_perf = model_performance(model,x_test,y_test)
print("Test performance")
pd.DataFrame(test_perf)
19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step Test performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 2.791322 | 1.4971 | 0.918701 | 0.845865 | 18.660195 |
The model has an $R^2$ of ~0.84 on the test set, which means it can explain ~84% of the variance in the unseen data
The RMSE value is ~2.7 , which means the model can predict the price of a used car within 2.7 units of the actual value
The MAPE value is ~1.4 , which means the model can predict the price of a used car within ~ 1.4% of the actual value
Business Insights and Recommendations¶
- Our neural network model has successfully explained approximately 94% of the variation in the data.
- Our analysis has revealed that certain factors, such as the year of manufacture, the number of seats, and the maximum power of the engine, tend to increase the price of a used car. Conversely, factors like the distance traveled and engine volume tend to decrease the price of a used car.
- Certain markets tend to have higher prices, and it would be beneficial for Cars4U to focus on these markets and establish offices in these areas if necessary.
- We need to gather data on the cost side of things before discussing profitability in the business.
- After analyzing the data, the next step would be to cluster the different data sets and determine whether we should create multiple models for different locations or car types.
Univariate Analysis¶
Kilometers_Driven¶
histogram_boxplot(df1, "Kilometers_Driven", bins=100, kde=True)
Observations
- This is another highly skewed distribution.
- Let us use log transformation on this column too.
df1["kilometers_driven_log"] = np.log(df1["Kilometers_Driven"])
histogram_boxplot(df1, "kilometers_driven_log", bins=100, kde=True)
- Transformation has reduced the extreme skewness.
mileage_num¶
histogram_boxplot(df1, "mileage_num", kde=True)
Observations
- This is a close to normally distributed attribute.
engine_num¶
histogram_boxplot(df1, "engine_num", kde=True)
Observations
- There are a few car with a higher engine displacement volume.
power_num¶
histogram_boxplot(df1, "power_num", kde=True)
Observations
- There are a few car with a higher engine power.
# creating histograms
df.hist(figsize=(14, 14))
plt.show()
Price: The price of a used car is the target variable and has a highly skewed distribution, with a median value of around 53.5 lakh INR. The log transformation was applied on this column to reduce skewness. The displacement volume of the engine, the maximum power of the engine and the price of a new car of the same model is highly correlated with the price of a used car.Mileage: This attribute has a close to normally distribution. With increase in mileage, the engine displacement and power decrease.Engine: There are a few upper outliers, indicating that there are a few car with a higher engine displacement volume. Higher priced cars have higher engine displacement. It is also highly correlated with the maximum engine power.Power: There are a few upper outliers, indicating that there are a few car with a higher power. Higher priced cars have higher maximum power. It is also highly correlated with the engine displacement volume.Kilometers_driven: The number of kilometers a used car is driven has a highly skewed distribution, with a median value of around 53.5 thousand. The log transformation was applied on this column to reduce skewness.New_Price: The price of a used car is the target variable and has a highly skewed distribution, with a median value of around 11.3 lakh INR. The log transformation was applied on this column to reduce skewness.Seats: 84% of the cars in the dataset are 5-seater cars.Year: More than half the cars in the data were manufactured in or after 2014. The price of used cars has increased over the years.Brand: Most of the cars in the data belong to Maruti or Hyundai. The price of used cars is lower for budget brands like Porsche, Bentley, Lamborghini, etc. The price of used cars is higher for premium brands like Maruti, Tata, Fiat, etc.Model: Maruti Swift is the most common car up for resale. The dataset contains used cars from luxury as well as budget-friendly brands.Location: Hyderabad and Mumbai have the most demand for used cars. The price of used cars has a large IQR in Coimbatore and Bangalore.Fuel_Type: Around 1% of the cars in the dataset do not run on diesel or petrol. Electric cars have the highest median price, followed by diesel cars.Transmission: More than 70% of the cars have manual transmission. The price is higher for used cars with automatic transmission.Owner_Type: More than 80% of the used cars are being sold for the first time. The price of cars decreases as they keep getting resold.
Model¶
labeled_barplot(df1, "Model", perc=True, n=10)
Observations
Maruti Swift is the most common car up for resale.
It is clear from the above charts that our dataset contains used cars from luxury as well as budget-friendly brands.
We can create a new variable using this information. We can consider binning all our cars into the following 3 categories later:
- Budget-Friendly
- Mid Range
- Luxury Cars
Seats¶
labeled_barplot(df1, "Seats", perc=True)
- 84% of the cars in the dataset are 5-seater cars.
Year¶
labeled_barplot(df1, "Year", perc=True)
- More than half the cars in the data were manufactured in or after 2014.
Transmission¶
labeled_barplot(df1, "Transmission", perc=True)
- More than 70% of the cars have manual transmission.
Owner_Type¶
labeled_barplot(df1, "Owner_Type", perc=True)
- More than 80% of the used cars are being sold for the first time.
Bivariate Analysis¶
Let's check the variation in Price with some of the other variables.
Price vs Transmission¶
plt.figure(figsize=(5, 5))
sns.boxplot(x="Transmission", y="Price", data=df1)
plt.show()
- The price is higher for used cars with automatic transmission.
Price vs Fuel_Type¶
plt.figure(figsize=(18, 5))
sns.boxplot(x="Fuel_Type", y="Price", data=df1)
plt.show()
- Electric cars have the highest median price, followed by diesel cars.
Price vs Brand¶
plt.figure(figsize=(18, 5))
sns.boxplot(x="Brand", y="Price", data=df1)
plt.xticks(rotation=90)
plt.show()
- The price of used cars is lower for budget brands like Maruti, Tata, Fiat, etc.
- The price of used cars is higher for premium brands like Porsche, Audi, Lamborghini, etc.
Price vs Transmission¶
plt.figure(figsize=(5, 5))
sns.boxplot(x="Transmission", y="Price", data=df1)
plt.show()
- The price is higher for used cars with automatic transmission.
Price vs Fuel_Type¶
plt.figure(figsize=(18, 5))
sns.boxplot(x="Fuel_Type", y="Price", data=df1)
plt.show()
- Electric cars have the highest median price, followed by diesel cars.
Price vs Owner_Type¶
plt.figure(figsize=(18, 5))
sns.boxplot(x="Owner_Type", y="Price", data=df1)
plt.show()
- The price of cars decreases as they keep getting resold.
Pairplot for relations between numerical variables¶
sns.pairplot(data=df1, hue="Fuel_Type")
plt.show()
Zooming into these plots gives us a lot of information.
Contrary to intuition,
Kilometers_Drivendoes not seem to have a relationship with the price.Pricehas a positive relationship withYear, i.e., the newer the car, the higher the price.- The temporal element of variation is captured in the year column.
2 seater cars are all luxury variants. Cars with 8-10 seats are exclusively mid to high range.
Mileage does not seem to show much relationship with the price of used cars.
Engine displacement and power of the car have a positive relationship with the price.
New_Priceand used car price are also positively correlated, which is expected.Kilometers_Drivenhas a peculiar relationship with theYearvariable. Generally, the newer the car lesser the distance it has traveled, but this is not always true.CNG cars are conspicuous outliers when it comes to
Mileage. The mileage of these cars is very high.The mileage and power of newer cars are increasing owing to advancements in technology.
Mileage has a negative correlation with engine displacement and power. More powerful the engine, the more fuel it consumes in general.